Datasource : https://open-data.bouldercolorado.gov/datasets/10c2e57f741b45428c492f67aeb98b1b_0/explore
HomePage : https://johnkimaiyo.vercel.app/
Python Projects : https://johnkimaiyo.vercel.app/Pages/Python%20Projects/Python.html
# Load in some packages
import calendar
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
# load dataset
boulder_df =pd.read_csv(r"C:\Users\jki\Downloads\Accounts_Payable (1).csv")
boulder_df.head(6)
vendor_name | transaction_date | transaction_amount | organization_code | fund | fund_desc | department | department_desc | object_code | account_description | ObjectId | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | PEASE, JUDITH L | 2024/01/16 | 246.45 | 18550010 | 1100 | General | 185 | Finance | 670395 | Misc Purchased Services | 1 |
1 | PUSH PEDAL PULL, INC | 2024/01/15 | 470.00 | 51523032 | 2300 | Recreation Activity | 515 | Parks and Recreation | 670395 | Misc Purchased Services | 2 |
2 | ECOARTS CONNECTIONS INC | 2024/01/16 | 700.00 | 95111004 | 1100 | General | 951 | Fundwide / Citywide | 670395 | Misc Purchased Services | 3 |
3 | XCEL ENERGY | 2024/01/16 | 12.69 | 61015020 | 6100 | Water Utility | 610 | Public Works - Utilities | 680210 | Gas & Electric | 4 |
4 | PUSH PEDAL PULL, INC | 2024/01/15 | 691.64 | 51523031 | 2300 | Recreation Activity | 515 | Parks and Recreation | 630250 | Materials: Equipment | 5 |
5 | GREATER WESTERN FENCE LLC | 2024/01/15 | 2770.00 | 55520180 | 2500 | Open Space | 555 | Open Space & Mountain Parks | 660800 | R&M: Land Improvements | 6 |
After getting a sense of the data's structure, it is a good idea to look at a statistical summary of the variables with df.describe()
boulder_df.describe()
transaction_amount | organization_code | fund | department | object_code | ObjectId | |
---|---|---|---|---|---|---|
count | 3.815620e+05 | 3.815620e+05 | 381562.000000 | 381562.000000 | 381562.000000 | 381562.000000 |
mean | 6.838728e+03 | 4.093152e+07 | 2992.008067 | 408.574481 | 649140.119310 | 190781.500000 |
std | 1.101914e+05 | 1.740279e+07 | 2209.419437 | 173.147329 | 75219.369195 | 110147.606043 |
min | -8.122976e+05 | 1.000000e+04 | 1.000000 | 0.000000 | 105200.000000 | 1.000000 |
25% | 5.915250e+01 | 3.101502e+07 | 1100.000000 | 310.000000 | 630310.000000 | 95391.250000 |
50% | 2.409200e+02 | 4.101005e+07 | 2200.000000 | 410.000000 | 660815.000000 | 190781.500000 |
75% | 1.028215e+03 | 5.552016e+07 | 6100.000000 | 555.000000 | 670395.000000 | 286171.750000 |
max | 3.682732e+07 | 9.716700e+07 | 8120.000000 | 951.000000 | 892101.000000 | 381562.000000 |
# lets check for missing values
missing_values = boulder_df.isna().sum()
print(missing_values)
vendor_name 0 transaction_date 0 transaction_amount 0 organization_code 0 fund 0 fund_desc 0 department 0 department_desc 8803 object_code 0 account_description 0 ObjectId 0 dtype: int64
# lets check the data types
boulder_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 381562 entries, 0 to 381561 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 vendor_name 381562 non-null object 1 transaction_date 381562 non-null object 2 transaction_amount 381562 non-null float64 3 organization_code 381562 non-null int64 4 fund 381562 non-null int64 5 fund_desc 381562 non-null object 6 department 381562 non-null int64 7 department_desc 372759 non-null object 8 object_code 381562 non-null int64 9 account_description 381562 non-null object 10 ObjectId 381562 non-null int64 dtypes: float64(1), int64(5), object(5) memory usage: 32.0+ MB
# lets remove missing values
boulder_df.dropna(subset=['department_desc'],inplace =True)
# lets confirm existance of missing values
missing_values =boulder_df.isna().sum()
print(missing_values)
vendor_name 0 transaction_date 0 transaction_amount 0 organization_code 0 fund 0 fund_desc 0 department 0 department_desc 0 object_code 0 account_description 0 ObjectId 0 dtype: int64
# lets change the date data type
boulder_df['Date'] = pd.to_datetime(boulder_df['transaction_date'])
boulder_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 372759 entries, 0 to 381559 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 vendor_name 372759 non-null object 1 transaction_date 372759 non-null object 2 transaction_amount 372759 non-null int32 3 organization_code 372759 non-null int64 4 fund 372759 non-null int64 5 fund_desc 372759 non-null object 6 department 372759 non-null int64 7 department_desc 372759 non-null object 8 object_code 372759 non-null int64 9 account_description 372759 non-null object 10 ObjectId 372759 non-null int64 11 Date 372759 non-null datetime64[ns] 12 Month 372759 non-null int32 13 Month Name 372759 non-null object dtypes: datetime64[ns](1), int32(2), int64(5), object(6) memory usage: 39.8+ MB
# Let's create a new variable, Month, from 'Order Date':
boulder_df['Month'] = boulder_df['Date'].dt.month
boulder_df['Month'].describe()
count 372759.000000 mean 6.594961 std 3.431795 min 1.000000 25% 4.000000 50% 7.000000 75% 10.000000 max 12.000000 Name: Month, dtype: float64
# lets change to taransaction amoount to integer
boulder_df['transaction_amount'] = boulder_df['transaction_amount'].astype(int)
boulder_df['transaction_amount'].info()
<class 'pandas.core.series.Series'> Int64Index: 372759 entries, 0 to 381559 Series name: transaction_amount Non-Null Count Dtype -------------- ----- 372759 non-null int32 dtypes: int32(1) memory usage: 4.3 MB
boulder_df.head(5)
vendor_name | transaction_date | transaction_amount | organization_code | fund | fund_desc | department | department_desc | object_code | account_description | ObjectId | Date | Month | Month Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PEASE, JUDITH L | 2024/01/16 | 246 | 18550010 | 1100 | General | 185 | Finance | 670395 | Misc Purchased Services | 1 | 2024-01-16 | 1 | Jan |
1 | PUSH PEDAL PULL, INC | 2024/01/15 | 470 | 51523032 | 2300 | Recreation Activity | 515 | Parks and Recreation | 670395 | Misc Purchased Services | 2 | 2024-01-15 | 1 | Jan |
2 | ECOARTS CONNECTIONS INC | 2024/01/16 | 700 | 95111004 | 1100 | General | 951 | Fundwide / Citywide | 670395 | Misc Purchased Services | 3 | 2024-01-16 | 1 | Jan |
3 | XCEL ENERGY | 2024/01/16 | 12 | 61015020 | 6100 | Water Utility | 610 | Public Works - Utilities | 680210 | Gas & Electric | 4 | 2024-01-16 | 1 | Jan |
4 | PUSH PEDAL PULL, INC | 2024/01/15 | 691 | 51523031 | 2300 | Recreation Activity | 515 | Parks and Recreation | 630250 | Materials: Equipment | 5 | 2024-01-15 | 1 | Jan |
# Load in some packages
import calendar
import warnings
import pandas as pd
import matplotlib.pyplot as plt
from itertools import combinations
from collections import Counter
warnings.filterwarnings("ignore")
# 1. What was the expensive month How much was expensed for that month?
# Replace NaN or inf values in the 'Month' column with a default value (e.g., 0)
boulder_df['Month'] = boulder_df['Month'].fillna(0).astype(int)
# Convert month numbers to abbreviated month names
boulder_df['Month Name'] = boulder_df['Month'].apply(lambda x: calendar.month_abbr[x])
# Group by month and calculate total sales for each month
expense_by_month = boulder_df.groupby('Month Name').sum()['transaction_amount']
# Find the highest month for expense
highest_month = expense_by_month.idxmax()
expense_for_best_month = expense_by_month.max()
print(f"The Highest month for highest expense was {highest_month} with expense of ${expense_for_best_month:,.2f}")
The Highest month for highest expense was Nov with expense of $259,930,331.00
# Plot the highest expense for each vendor
expense_by_vendor = boulder_df.groupby('vendor_name').sum()['transaction_amount']
# Sort the values in descending order and select the top five
top_five_expense_by_vendor= expense_by_vendor.sort_values(ascending=False).head(5)
# Display the result
print(top_five_expense_by_vendor)
# Plot the top five County
top_five_expense_by_vendor.plot(kind='bar', color='brown', figsize=(10, 6))
plt.title('Top Five Vendors with the highest expense account')
plt.xlabel('Vendor Name')
plt.ylabel('Expense')
plt.show()
vendor_name US BANK 242598382 FIDELITY NATIONAL TITLE COMPANY 93114355 LAND TITLE GUARANTEE COMPANY 53115030 NORTHERN COLORADO WATER CONSERVANCY DISTRICT 49781435 XCEL ENERGY 40165691 Name: transaction_amount, dtype: int32
# Plot the highest expense
expense_by_cost_description = boulder_df.groupby('fund_desc').sum()['transaction_amount']
# Sort the values in descending order and select the top five
top_five_expense_by_cost_description = expense_by_cost_description.sort_values(ascending=False).head(5)
# Display the result
print(top_five_expense_by_cost_description)
# Plot the top five County
top_five_expense_by_cost_description.plot(kind='bar', color='yellow', figsize=(10, 6))
plt.title('Top Five Cost with the highest expense account')
plt.xlabel('fund_description')
plt.ylabel('Expense')
plt.show()
fund_desc General 320767316 Water Utility 310597008 Transportation 222160900 Wastewater Utility 160834236 Open Space 134172438 Name: transaction_amount, dtype: int32
# Plot the highest expense for each department
expense_by_department = boulder_df.groupby('department_desc').sum()['transaction_amount']
# Sort the values in descending order and select the top five
top_five_expense_by_department = expense_by_department.sort_values(ascending=False).head(5)
# Display the result
print(top_five_expense_by_department)
# Plot the top five County
top_five_expense_by_department.plot(kind='bar', color='purple', figsize=(10, 6))
plt.title('Top Five department with the highest expense account')
plt.xlabel('fund_description')
plt.ylabel('Expense')
plt.show()
department_desc Public Works - Utilities 563521652 Public Works - Transportation 242354042 Fundwide / Citywide 228828532 Facilities & Fleet 172466157 Open Space & Mountain Parks 138763898 Name: transaction_amount, dtype: int32
# Plot the highest expense for each Centers
expense_by_cost_centers = boulder_df.groupby('account_description').sum()['transaction_amount']
# Sort the values in descending order and select the top five
top_five_expense_by_cost_centers = expense_by_cost_centers.sort_values(ascending=False).head(5)
# Display the result
print(top_five_expense_by_cost_centers)
# Plot the top five County
top_five_expense_by_department.plot(kind='bar', color='green', figsize=(10, 6))
plt.title('Top Five Centers with the highest expense account')
plt.xlabel('fund_description')
plt.ylabel('Expense')
plt.show()
account_description Capital: Utility Infrastrcture 220308868 Community Funding/Grants 201435391 Capital: Infrastructure 197742579 Misc Purchased Services 177944536 Capital: Real Estate Acq 110795689 Name: transaction_amount, dtype: int32
boulder_df['vendor_name'].value_counts()
PCARD ONETIME VENDOR 44295 MCGUCKIN HARDWARE 12695 XCEL ENERGY 10719 HOME DEPOT USA INC 7257 CENTURYLINK 4484 ... LEMUS, SONNY 1 PADILLA, JOCELYN MARIE BUCHER 1 PADILLA, ISABEL G BUCHER 1 STADSKLEV, ERIKA 1 LARRY RODGERS DESIGN GROUP 1 Name: vendor_name, Length: 8066, dtype: int64
expense_by_vendor = boulder_df['vendor_name'].value_counts()
# Sort the values in descending order and select the top five
top_five_expense_by_vendor = expense_by_vendor.sort_values(ascending=False).head(5)
# Display the result
print(top_five_expense_by_vendor )
# Plot the top five County
top_five_expense_by_vendor .plot(kind='bar', color='green', figsize=(10, 6))
plt.title('Top Five Vendors Commonly Used')
plt.xlabel('fund_description')
plt.ylabel('Expense')
plt.show()
PCARD ONETIME VENDOR 44295 MCGUCKIN HARDWARE 12695 XCEL ENERGY 10719 HOME DEPOT USA INC 7257 CENTURYLINK 4484 Name: vendor_name, dtype: int64
boulder_df['transaction_amount'].mean()
rounded_mean = round(boulder_df['transaction_amount'].mean())
print(rounded_mean)
5431
boulder_df[boulder_df.where(boulder_df['transaction_amount']>5431,other=0).all(1)]
vendor_name | transaction_date | transaction_amount | organization_code | fund | fund_desc | department | department_desc | object_code | account_description | ObjectId | Date | Month | Month Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | IMPACT CHARITABLE | 2024/01/15 | 6000 | 95111004 | 1100 | General | 951 | Fundwide / Citywide | 670395 | Misc Purchased Services | 10 | 2024-01-15 | 1 | Jan |
54 | CITY OF LONGMONT | 2024/01/12 | 25645 | 40515060 | 2910 | Comm Development Block Grant | 405 | Housing | 670184 | Community Funding/Grants | 55 | 2024-01-12 | 1 | Jan |
62 | BROWN AND CALDWELL | 2024/01/12 | 40303 | 61090010 | 6100 | Water Utility | 610 | Public Works - Utilities | 810730 | Capital: Utility Infrastrcture | 63 | 2024-01-12 | 1 | Jan |
64 | MOUNTAIN VIEW INTEGRATION LLC | 2024/01/12 | 11249 | 31510040 | 1100 | General | 315 | Facilities & Fleet | 660878 | R&M: Software Maint Agreements | 65 | 2024-01-12 | 1 | Jan |
78 | TIERRA ROJO CORPORATION | 2024/01/12 | 6065 | 14524430 | 2400 | Climate Tax Fund | 145 | Climate Initiatives | 670184 | Community Funding/Grants | 79 | 2024-01-12 | 1 | Jan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
381485 | OUTPUT SERVICES INC | 2014/12/20 | 9500 | 18520040 | 1100 | General | 185 | Finance | 620118 | Postage and Express Mail | 381486 | 2014-12-20 | 12 | Dec |
381490 | NORTHERN COLORADO WATER CONSERVANCY DISTRICT | 2014/12/19 | 500000 | 61090010 | 6100 | Water Utility | 610 | Public Works - Utilities | 670078 | Water & Ditch Assmts: Other | 381491 | 2014-12-19 | 12 | Dec |
381500 | WSP USA ENVIRONMENT & INFRASTRUCTURE INC | 2014/12/18 | 9976 | 61010010 | 6100 | Water Utility | 610 | Public Works - Utilities | 650720 | Engineering Consultants | 381501 | 2014-12-18 | 12 | Dec |
381519 | INTEGRATING TECHNOLOGY & STANDARDS INC | 2014/01/29 | 132961 | 61090010 | 6100 | Water Utility | 610 | Public Works - Utilities | 810730 | Capital: Utility Infrastrcture | 381520 | 2014-01-29 | 1 | Jan |
381559 | LARRY RODGERS DESIGN GROUP | 2005/05/26 | 10000 | 51595030 | 3300 | Permanent Parks & Recreation | 515 | Parks and Recreation | 650710 | Architectural Consultants | 381560 | 2005-05-26 | 5 | May |
32428 rows × 14 columns
f1 = boulder_df['transaction_amount']>5431
f2 = boulder_df['vendor_name']=="PCARD ONETIME VENDOR"
boulder_df[boulder_df.where(f1& f2,other=0).all(1)]
vendor_name | transaction_date | transaction_amount | organization_code | fund | fund_desc | department | department_desc | object_code | account_description | ObjectId | Date | Month | Month Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2625 | PCARD ONETIME VENDOR | 2023/12/26 | 6386 | 25510010 | 1100 | General | 255 | Police | 710500 | Business Travel | 2626 | 2023-12-26 | 12 | Dec |
2628 | PCARD ONETIME VENDOR | 2023/12/26 | 7683 | 55520220 | 2500 | Open Space | 555 | Open Space & Mountain Parks | 630265 | Materials: Tools | 2629 | 2023-12-26 | 12 | Dec |
2703 | PCARD ONETIME VENDOR | 2023/12/26 | 8763 | 25510040 | 1100 | General | 255 | Police | 630250 | Materials: Equipment | 2704 | 2023-12-26 | 12 | Dec |
2707 | PCARD ONETIME VENDOR | 2023/12/26 | 11105 | 61040060 | 6200 | Wastewater Utility | 610 | Public Works - Utilities | 660866 | R&M: Mach & Equip | 2708 | 2023-12-26 | 12 | Dec |
2709 | PCARD ONETIME VENDOR | 2023/12/26 | 22335 | 19015050 | 7300 | Computer Replacement Fund | 190 | Information Technology | 690325 | Cellular Phone Service | 2710 | 2023-12-26 | 12 | Dec |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
360506 | PCARD ONETIME VENDOR | 2015/06/25 | 6198 | 31065010 | 1100 | General | 310 | Public Works - Transportation | 630699 | Materials: Miscellaneous | 360507 | 2015-06-25 | 6 | Jun |
360507 | PCARD ONETIME VENDOR | 2015/06/25 | 24588 | 41005010 | 1100 | General | 410 | Human Services | 630275 | Materials: Furniture | 360508 | 2015-06-25 | 6 | Jun |
368776 | PCARD ONETIME VENDOR | 2015/04/25 | 8250 | 61060080 | 6300 | Stormwater & Flood Mgt Utility | 610 | Public Works - Utilities | 670395 | Misc Purchased Services | 368777 | 2015-04-25 | 4 | Apr |
368781 | PCARD ONETIME VENDOR | 2015/04/25 | 35449 | 18530030 | 7120 | Workers Compensation Ins | 185 | Finance | 630699 | Materials: Miscellaneous | 368782 | 2015-04-25 | 4 | Apr |
372484 | PCARD ONETIME VENDOR | 2015/03/25 | 5500 | 25515100 | 1100 | General | 255 | Police | 630245 | Materials: Computer Software | 372485 | 2015-03-25 | 3 | Mar |
334 rows × 14 columns
boulder_df.nlargest(3,'transaction_amount')
vendor_name | transaction_date | transaction_amount | organization_code | fund | fund_desc | department | department_desc | object_code | account_description | ObjectId | Date | Month | Month Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
337818 | FIDELITY NATIONAL TITLE COMPANY | 2015/12/04 | 36827319 | 95101671 | 6700 | BMPA Debt Service | 951 | Fundwide / Citywide | 810710 | Capital: Real Estate Acq | 337819 | 2015-12-04 | 12 | Dec |
222630 | NORTHERN COLORADO WATER CONSERVANCY DISTRICT | 2018/04/25 | 32677237 | 61090010 | 6100 | Water Utility | 610 | Public Works - Utilities | 810730 | Capital: Utility Infrastrcture | 222631 | 2018-04-25 | 4 | Apr |
154948 | FIDELITY NATIONAL TITLE COMPANY | 2019/11/06 | 9500000 | 40515040 | 2140 | Affordable Housing | 405 | Housing | 670184 | Community Funding/Grants | 154949 | 2019-11-06 | 11 | Nov |
boulder_df.nsmallest(3,'transaction_amount')
vendor_name | transaction_date | transaction_amount | organization_code | fund | fund_desc | department | department_desc | object_code | account_description | ObjectId | Date | Month | Month Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
259495 | PCARD ONETIME VENDOR | 2017/07/25 | -812297 | 18505005 | 1100 | General | 185 | Finance | 670395 | Misc Purchased Services | 259496 | 2017-07-25 | 7 | Jul |
267358 | PCARD ONETIME VENDOR | 2017/05/25 | -612064 | 18505005 | 1100 | General | 185 | Finance | 670395 | Misc Purchased Services | 267359 | 2017-05-25 | 5 | May |
255394 | PCARD ONETIME VENDOR | 2017/08/25 | -552550 | 18505005 | 1100 | General | 185 | Finance | 670395 | Misc Purchased Services | 255395 | 2017-08-25 | 8 | Aug |